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Abstract 

Text of abstract We present a family of pairwise tournaments reducing k- 
class classification to binary classification. These reductions are provably 
robust against a constant fraction of binary errors, simultaneously matching 
the best possible computation 0(logA;) and regret 0(1). 

The construction also works for robustly selecting the best of /c-choices 
by tournament. We strengthen previous results by defeating a more powerful 
adversary than previously addressed while providing a new form of analysis. 
In this setting, the error correcting tournament has depth 0(log/c) while 
using 0{k log k) comparators, both optimal up to a small constant. 

Keywords: reductions, multiclass classification, cost-sensitive learning, 
tournaments, robust search 



1. Introduction 

We consider the classical problem of multiclass classification, where given 
an instance x G X, the goal is to predict the most likely label y G {1, . . . , /c}, 
according to some unknown probability distribution. 

A common general approach to multiclass learning is to reduce a mul- 
ticlass problem to a set of binary classification problems [2, 7, 11, 12, 15]. 
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This approach is composable with any binary learning algorithm, including 
online algorithms, Bayesian algorithms, and even humans. 

A key technique for analyzing reductions is regret analysis, which bounds 
the regret of the resulting multiclass classifier in terms of the average classifi- 
cation regret on the induced binary problems. Here regret (formally defined 
in Section 2) is the difference between the incurred loss and the smallest 
achievable loss on the problem, i.e., excess loss due to suboptimal predic- 
tion. 

The most commonly applied reduction is one-against-all, which creates 

a binary classification problem for each of the k classes. The classifier for 
class i is trained to predict whether the label is i or not; predictions are then 
done by evaluating each binary classifier and randomizing over those that 
predict "yes," or over all labels if all answers are "no" . 

This simple reduction is inconsistent, in the sense that given optimal 
(zero-regret) binary classifiers, the reduction may not yield an optimal mul- 
ticlass classifier in the presence of noise. Optimizing squared loss of the 
binary predictions instead of the 0/1 loss makes the approach consistent, 
but the resulting multiclass regret scales as \j2kr in the worst case, where 
r is the average squared loss regret on the induced problems. The Probing 
reduction [16] upper bounds r by the average binary classification regret. 
This composition gives a consistent reduction to binary classification, but it 
has a square root dependence on the binary regret (which is undesirable as 
regrets are between and 1). 

The probabilistic error-correcting output code approach (PECOC) [15] 
reduces A;-class classification to learning 0{k) regressors on the interval [0, 1], 
creating 0(/c) binary examples per multiclass example at both training and 
test time, with a test time computation of 0{k^). The resulting multiclass 
regret is bounded by 4y^, removing the dependence on the number of classes 
k. When only a constant number of labels have non-zero probability given 
features, the computation can be reduced to 0(A;logA;) per example [14]. 

This state of the problem raises several questions: 

1. Is there a consistent reduction from multiclass to binary classification 

that does not have a square root dependence on r [18]? For example, 
an average binary regret of just 0.01 may imply a PECOC multiclass 
regret of 0.4. 

2. Is there a consistent reduction that requires just 0(log k) computation, 
matching the information theoretic lower bound? 

The well-known 0(logA;) tree reduction distinguishes between the la- 
bels using a balanced binary tree, with each non-leaf node predicting 
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"Is the correct multiclass label to the left or not?" [10]. As shown in 
Section 3, this method is inconsistent. 
3. Can the above be achieved with a reduction that only performs pair- 
wise comparisons between classes? 

One fear associated with the PECOC approach is that it creates binary 
problems of the form "What is the probability that the label is in a 
given random subset of labels?," which may be hard to solve. Although 
this fear is addressed by regret analysis (as the latter operates only on 
avoidable, excess loss), and is overstated in some cases [9, 14], it is still 
of some concern, especially with larger values of k. 

The error-correcting tournament family presented here answers all of these 
questions in the affirmative. It provides an exponentially faster in k method 

for multiclass prediction with the resulting multiclass regret bounded by 
5.5r, where r is the average binary regret; and every binary classifier logically 
compares two distinct class labels. 

The result is based on a basic observation that if a non-leaf node fails 
to predict its binary label, which may be unavoidable due to noise in the 
distribution, nodes between this node and the root should have no prefer- 
ence for class label prediction. Utilizing this observation, we construct a 
reduction, called the filter tree, which uses a 0(log k) computation per mul- 
ticlass example at both training and test time, and whose multiclass regret 
is bounded by log/c times the average binary regret. 

The decision process of a filter tree, viewed bottom up, can be viewed as 
a single-elimination tournament on a set of k players. Using multiple inde- 
pendent single-elimination tournaments is of no use as it does not affect the 
average regret of an adversary controlling the binary classifiers. Somewhat 
surprisingly, it is possible to have log k complete single-elimination tourna- 
ments between k players in 0(logA;) rounds, with no player playing twice in 
the same round. An error- correcting tournament, first pairs labels in such 
simultaneous single-elimination tournaments, followed by a final carefully 
weighted single-elimination tournament that decides among the logfc win- 
ners of the first phase. As for the filter tree, test time evaluation can start 
at the root and proceed to a multiclass label with 0(logA;) computation. 

This construction is also useful for the problem of robust search, yielding 
the first algorithm which allows the adversary to err a constant fraction 
of the time in the "full lie" setting [17], where a comparator can missort 
any comparison. Previous work either applied to the "half lie" case where 
a comparator can fail to sort but can not actively missort [6, 20], or to 
a "full lie" setting where an adversary has a fixed known bound on the 
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number of lies [17] or a fixed budget on the fraction of errors so far [5, 3]. 
Indeed, it might even appear impossible to have an algorithm robust to a 
constant fraction of full lie errors since an error can always be reserved for 
the last comparison. Repeating the last comparison 0(log k) times defeats 
this strategy. 

The result here is also useful for the actual problem of tournament con- 
struction in games with real players. Our analysis does not assume that 
errors are i.i.d. [8], or have known noise distributions [1] or known outcome 
distributions given player skills [13]. Consequently, the tournaments we con- 
struct are robust against severe bias such as a biased referee or some forms of 
bribery and collusion. Furthermore, the tournaments we construct are shal- 
low, requiring fewer rounds than m-elimination bracket tournaments, which 
do not satisfy the guarantee provided here. In an m-elimination bracket tour- 
nament, bracket i is a single-elimination tournament on all players except 
the winners of brackets 1, . . . , z — 1. After the bracket winners are deter- 
mined, the player winning the last bracket m plays the winner of bracket 
m — 1 repeatedly until one player has suffered m losses (they start with 
m — 1 and m — 2 losses respectively). The winner moves on to pair against 
the winner of bracket m — 2, and the process continues until only one player 
remains. This method does not scale well to large m, as the final elimination 
phase takes YlT^i i — ^ = 0{vn?) rounds. Even for A; = 8 and m = 3, our con- 
structions have smaller maximum depth than bracketed 3-elimination. To 
see that the bracketed m-climination tournament does not satisfy our goal, 
note that the second-best player could defeat the first player in the first 
single elimination tournament, and then once more in the final elimination 
phase to win, implying that an adversary need control only two matches. 

Paper overview. We begin by defining the basic concepts and introducing 
some of the notation in Section 2. Section 3 shows that the simple divide- 
and-conquer tree approach is inconsistent, motivating the Filter Tree algo- 
rithm described in section 4 (which applies to more general cost-sensitive 
multiclass problems). Section 5 proves that the algorithm has the best pos- 
sible computational dependence, and gives two upper bounds on the regret 
of the returned (cost-sensitive) multiclass classifier. Subsection 5.4 presents 
some experimental evidence that the Filter Tree is indeed a practical ap- 
proach for multiclass classification. 

Section 6 presents the error-correcting tournament family parametrized 
by an integer m > 1, which controls the tradeoff between maximizing ro- 
bustness (m large) and minimizing depth (m small). Setting m = 1 gives 
the Filter Tree, while m = 4lnk gives a (multiclass to binary) regret ratio 
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of 5.5 with 0(logA;) depth. Setting m = ck gives regret ratio of 3 + 0{l/c) 
with depth 0{k). The results here provide a nearly free generalization of 
earlier work [6] in the robust search setting, to a more powerful adversary 
that can missort as well as fail to sort. 

Section 7 gives an algorithm independent lower bound of 2 on the regret 
ratio for large k. When the number of calls to a binary classifier is inde- 
pendent (or nearly independent) of the label predicted, we strengthen this 
lower bound to 3 for large k. 

2. Preliminaries 

Let D be the underlying distribution over X x Y, where X is some 
observable feature space and Y = {1,... ,k} is the label space. The error 
rate of a classifier f : X ^Y on D is given by 

err(/,L>) = Pr(^^j^)^£,[/(x) / y]. 

The multiclass regret ol f on D is defined as 

reg(/,D) = err(/,D) - min err(5,D). 

The algorithms here extend to the cost-sensitive case, where the underlying 
distribution D is over X x [0, 1]^. The expected cost of a classifier f : X ^Y 
on D is 

^(/,^) = E(^,c)~D [Cf{x)] ■ 

Here c € [0, 1]*^ gives the cost of each of the k choices for x. As in the 
multiclass case, the cost-sensitive regret of f on D is defined as 

creg(/, D) = e{f, D) - min^ i{g, D). 

3. Inconsistency of Divide and Conquer Trees 

One standard approach for reducing multiclass learning to binary learn- 
ing is to split the set of labels in half, learn a binary classifier to distinguish 
between the two subsets, and repeat recursively until each subset contains 
one label. Multiclass predictions are made by following a chain of classifica- 
tions from the root down to the leaves. 

This tree reduction transforms D into a distribution Dt over binary la- 
beled examples by drawing a multiclass example (x, y) from D and a random 
non-leaf node i, and outputting instance (x, i) with label 1 if y is in the left 
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Figure 1: Filter Tree. Each node predicts whether the left or the right input label is more 
likely, conditioned on a given x £ X. The root node predicts the best label for x. 

subtree of node z, and otherwise. A binary classifier / for this induced 
problem gives a multiclass classifier r(/), via a chain of binary predictions 
starting from the root. 

The following theorem gives an example of a multiclass problem such 
that even if we have an optimal classifier for the induced binary problem at 
each node, the tree reduction does not yield an optimal multiclass predictor. 

Theorem 1. For all k>Z, for all binary trees over the labels, there exists 
a multiclass distribution D such that reg(r(/*),D) > for any f* = 
arg nun err(/, Dt) ■ 

Proof: Find a node with one subset corresponding to two labels and the 

other subset corresponding to a single label. (If the tree is perfectly balanced, 
simply let D assign probability to one of the labels.) Since we can freely 
rename labels without changing the underlying problem, let the first two 
labels be 1 and 2, and the third label be 3. 

Fix any e € (0, 1/12). Choose D with the property that labels 1 and 2 
each have a | + e chance of being drawn given x, and label 3 is drawn with 
the remaining probability of | — 2e. Under this distribution, the fraction of 
examples for which label 1 or 2 is correct is ^ + 2e, so any minimum error 
rate binary predictor must choose either label 1 or label 2. Each of these 
choices has an error rate of | — e. The optimal multiclass predictor chooses 
label 3 and suffers an error rate of | + 2e, implying that the regret of the tree 
classifier based on the optimal binary classifier is | — 3e, which is strictly 
greater than as e < 1/12. | 

4. The Filter Tree Algorithm 

The Filter Tree algorithm, illustrated by Figure 1, is equivalent to a 
single-elimination tournament on the set of labels, structured as a binary 
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Algorithm 1 Filter tree training (multiclass training set S, binary learning 
algorithm Learn) 

Define yu = 1 if label y is in the left subtree of node u; otherwise yu = 0. 

for each non-leaf node n in order from leaves to root do 
Set 5„ = 

for each (x, y) £ S such that y G L{Tn) and all nodes u on the path 

y predict j/„ given x do 

add {x, yn) to 
end for 

Let /„ = Learn(5'„) 
end for 

return / = {/„} 



tree T over the labels. In the first round, the labels are paired according to 
the lowest level of the tree, and a classifier is trained for each pair to predict 
which of the two labels is more likely. (The labels that don't have a pair in 
a given round, win that round for free.) The winning labels from the first 
round are in turn paired in the second round, and a classifier is trained to 
predict whether the winner of one pair is more likely than the winner of 
the other. The process of training classifiers to predict the best of a pair 
of winners from the previous round is repeated until the root classifier is 
trained. 

The key trick in the training stage (Algorithm 1) is to form the right 
training set at each interior node. We use T„ to denote the subtree of T 
rooted at node n, and L{T) to denote the set of leaves in the tree T. A 
training example for node n is formed conditioned on the predictions of 
classifiers in the round before it. Thus the learned classifiers from the first 
level of the tree are used to "filter" the distribution over examples reaching 
the second level of the tree. 

Given x and classifiers at each node, every edge in T is identified with 
a unique label. The optimal decision at any non-leaf node is to choose 
the input edge (label) that is more likely according to the true conditional 
probability. This can be done by using the outputs of classifiers in the round 
before it as a filter during the training process: For each observation, we set 
the label to if the left parent's output matches the multiclass label, 1 if 
the right parent's output matches, and reject the example otherwise. 

Algorithm 2 extends this idea to the cost-sensitive multiclass case where 
each choice has a different associated cost, as defined in Section 2. The algo- 
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Algorithm 2 Cost-sensitive filter tree training (cost-sensitive training set 
S, importance weighted binary learner Learn) 



1: for each non-leaf node n in the order from leaves to root do 

2: Set Sn = 

3: for each example {x,ci, ...,Ck) G S do 

4: Let a and b be the two classes input to n 

5: Sn-^ SnU {(x, arg min{Ca, Cfo}, \ca - Cb\)} 

6: end for w„{x,c) 

7: Let /„ = Learn(S'„) 

8: end for 

9: return / = {/„} 



rithm relies upon an importance weighted binary learning algorithm Learn, 
which takes examples of the form {x,y,w), where x e X is a feature vec- 
tor used for prediction, y G {0, 1} is a binary label, and w G [0, oo) is the 
importance any classifier pays if it doesn't predict y on x. The importance 
weighted problem can be further reduced to binary classification using the 
Costing reduction [21], which alters the underlying distribution using rejec- 
tion sampling on the importances. This is the reduction we use here. 

The testing algorithm is the same for both multiclass and cost-sensitive 
variants, and is very simple: Given a test example x & X, we output the 
label y such that every classifier on the path from the root to y prefers y. 

5. Filter Tree Analysis 

Before analyzing the regret of the algorithm, we note its computational 
characteristics. 

5.1. Computational Complexity 

Since the algorithm is a reduction, we count the computational complex- 
ity of the reduction itself, assuming that oracle calls take Tinit time. 

Algorithm 1 requires 0(log k) computation per multiclass example, by 
searching for the correct leaf in O(logfe) time, then filtering back toward 
the root. This matches the information theoretic lower bound since simply 
reading one of k labels requires [log2 k^ bits. 

Algorithm 2 requires 0{k) computation per cost-sensitive example, be- 
cause there are k — 1 nodes, each requiring constant computation per exam- 
ple. Since any method must read the k costs, this bound is tight. 
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Testing requires 0(log k) computation per example to descend a binary 
tree. Any method must write out [logg k] bits to specify its prediction. 

5.2. Regret Analysis 

Algorithm 2 transforms each cost-sensitive multiclass example (line 3) 
into importance weighted binary labeled examples (line 5), one for every non- 
leaf node n in the tree. This process implicitly transforms the underlying 
distribution D over cost-sensitive multiclass examples into a distribution Dn 
over importance weighted binary examples at each n. 

We can further reduce from importance weighted binary classification 
to binary classification using the Costing reduction [21], which alters each 
Dn using rejection sampling on the importance weights. This composition 
further transforms -D„ into a distribution D'^ over binary examples. 

Let fn be a classifier for the binary classification problem induced at 
node n. The relevant quantity is the average binary regret, 

reg(/, D') = \ Yl ^n) Wn, (1) 

where W„ = E(^x ,,)r^DWn{x, c), and Wn{x, c) is the importance weight formed 
in line 5 of Algorithm 2 (the difference in cost between the two labels that 

node n chooses between on x). This quantity, which is just the average 
importance weighted binary regret of /„ on Dn, is induced by the reduction 
(Algorithm 2). 

The core theorem below relates reg(/, D') to the regret of the resulting 
cost-sensitive classifier T(/) on D. Again, given a test example x £ X, the 
classifier T{f) returns the unique label y such that every /„ on the path 
from the root to y prefers y. 

This type of analysis is similar to Boosting: At each round n, the booster 
creates an input distribution Dn and calls a weak learning algorithm to 
obtain a classifier /„, which has some error rate on D„. The distribution Dn 
depends on the classifiers returned by the oracle in previous rounds. The 
accuracy of the final classifier on the original distribution D is analyzed in 
terms of these error rates. 

Theorem 2. For all binary classifiers f and all cost-sensitive multiclass 
distributions D, 

cieg{T{f),D) <reg{f,D') ^Wn, 

neT 
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where Wn = E(^^ ,,-^^£)Wn{x, c) , and Wn(x,c) is the importance weight formed 
in line 5 of Algorithm 2 (the difference in cost between the two labels that 
node n chooses between on x). 

Before proving the theorem, we state the corollary for multiclass classifica- 
tion. 

Corollary 3. For all binary classifiers f and multiclass distributions D, 

Kg{T{f),D)<dTeg{f,D'), 
where d is the depth of the tree T . 

Since all importance weights are either or 1, we don't need to apply Costing 
in the multiclass case. The proof of the corollary given the theorem is simple 
since for any {x,y), the induced (x, c) has at most one node per level with 
indTiccd importance weight 1; all other importance weights are 0. Therefore, 

Y.n'^n{x,c) < d. 

Theorem 4 provides an alternative bound for cost-sensitive classification. 
It is the first known bound giving a worst-case dependence of less than k. 

Theorem 4. For all binary classifiers f and all cost-sensitive k-class dis- 
tributions D, 

creg{T{f),D)<kTeg{f,D')/2, 
where T{f) and D' are as defined above. 

A simple example in Section 5.3 shows that this bound is essentially tight. 

The proof of Theorem 2 uses the following folk theorem from [21]. 

Theorem 5. (Translation Theorem [21]) For any importance-weighted dis- 
tribution P, there exists a constant (c) = E(j;^y_c)~p[c] such that for any 
classifier f , 

^{x,y,Or^p[c ■ Hf{x) + y)\ = (c)E(^,j/,,)^p,[l(/(a:) ^ y)], 

where P'{x,y,c) = ■^P{x,y,c). 

Thus choosing / to minimize the error rate under P' is equivalent to 
choosing / to minimize the expected cost under P. The Costing [21] reduc- 
tion uses rejection sampling according to the weights to draw examples from 

P' given examples drawn from P. 

The remainder of this section proves Theorems 2 and 4. 
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Proof of Theorem 2: It is sufficient to prove tfie claim for any x E X 

because that implies that the result holds for all expectations over x. 

Conditioned on the value of x, each label y has a distribution over costs 
with an expected value of E(,^j;)|3.[cy]. The zero regret cost-sensitive classifier 
predicts according to argmin^^ E(,^£)|3,[cj^]. Suppose that T{f) predicts y' on 
X, inducing cost-sensitive regret 

Creg(y',D | x) = Ec^D\x[Cy'] - T^iT^'^c^D\x[Cy]- 

First, we show that the sum over the binary problems of the importance 
weighted regret is at least CTeg{y',D | x), using induction starting at the 
leaves. The induction hypothesis is that the sum of the regrets of importance- 
weighted binary classifiers in any subtree bounds the regret of the subtree 
output. 

For node n, each importance weighted binary decision between class a 
and class b has an importance weighted regret which is either or r„ = 
|E^~D|x[ca -Cb]\ = \Esr^D\^[ca] - ^Sr^D\x[cb\\, depending on whether the pre- 
diction is correct or not. 

Assume without loss of generality that the predictor outputs class b. The 
regret of the subtree r„ rooted at n is given by 

rTn = ^cr^D\x [cb] - min F^s^^dIx [cy] ■ 
yeL(Tn) 

As a base case, the inductive hypothesis is trivially satisfied for trees with 
one label. Inductively, assume that J2n'eL''^n' ^ I'L and ^^/g^r^/ > rji for 
the left subtree L of n (providing a) and the right subtree R (providing b). 

There are two possibilities. Either the minimizer comes from the leaves 
of L or the leaves of R. The second possibility is easy since we have 

= Eg^D\x[cb] - mill E_.^^|^[c^] = rij < V r„/ < V r„/, 

proving the induction. 
For the first possibility, 

^T„ = ^s^D\x[cb] - mill Ec-^D|x[cj/] 

yeL(L) 

= ^c^D\x[Cb] - Eg-^£)|a.[Ca] "I" E5^D|j;[Ca] — mijl 'Eg^j:,\^[Cy] 

yeL{L) 

= ^c^D\x[Cb] - ^c^D\x[Ca] + TL 
n'eL n'eT„ 
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which completes the induction. The inductive hypothesis for the root is that 
creg(y',i:>|x) < Ener''"- 

Using the folk theorem from [21] (Theorem 5 in this paper), each is 
bounded by 

rn<WnTeg{fn,D'J. 

Plugging this in and using Definition (1), wc get the theorem. | 

The proof of Theorem 4 makes use of the following lemma. Consider a filter 
tree T on k labels, evaluated on a cost-sensitive multiclass example with 
cost vector c G [0, 1]^. Let St be the sum of importances over all nodes in 
T, and It be the sum of importances over the nodes where the class with 
the larger cost was selected for the next round. Let ct denote the cost of 
the winner chosen by T. 

Lemma 6. For any c € [0, l]'^, St + ct <It + ^- 

Proof: The inequality follows by induction, the result being immediate 
when k = 2. Assume that the claim holds for the two subtrees, L and R, 
providing their respective inputs I and r to the root of T, and T outputs r 
without loss of generality. Using the inductive hypotheses for L and R, we 
get St + ct = Sl + Sb. + \cr - q] + < /l + /r + | - q + |cv - q|. 
If cv > q, we have It = II + Ir + (cr — q), and 

k k 
St + CT < It + - - ci < It + -, 

as desired. If cv < q, we have It = Il+Ir and St+ct < It+^—Cv < It+^, 
completing the proof. | 

Proof of Theorem 4: Fix {x,c) G X x [0,1]'^ and take the expectation 

over the draw of (x, c) from D as the last step. 

Consider a filter tree T evaluated on (x, c) using a given binary classifier 
/. As before, let St be the sum of importances over all nodes in T, and 
It be the sum of importances over the nodes where / made a mistake. 
Recall that the regret of T on {x,c), denoted in the proof by reg-p, is the 
difference between the cost of the tree's output and the smallest cost c*. The 
importance- weighted binary regret of / on (x, c) is simply It /St- Since the 
expected importance is upper bounded by 1, It /St also bounds the binary 
regret of /. 

The inequality we need to prove is legj^ST < f-^T- The proof is by 
induction on k, the result being trivial if fc = 2. Assume that the assertion 
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holds for the two subtrees, L and R, providing their respective inputs I and 
r to the root of T. (The number of classes in L and R can be taken to be 
even, by splitting the odd class into two classes with the same cost as the 
split class, which has no effect on the quantities in the theorem statement.) 

Let the best cost c* be in the left subtree L. Suppose first (Case 1) 
that T chooses r and Cr > ci . Let w = Cr — Q. We have reg^, = q — c* and 
regT = Or — c* = reg^ + w. The left hand side of the inequality is thus 



The first inequality follows from lemma 6. The second and fourth follow 
from u'(reg^ — ci — Cr + w) < 0. The third follows from regj;, < II- The last 



The proofs for the remaining three cases (ct = q < c^, ct = ci > Cr, 
and ci > Cr = ct) use the same machinery as the proof above. 

Case 2: T outputs /, and q < Cr- In this case reg^- = reg^ = q — c*. The 
left hand side can be rewritten as 



The first inequality follows from the lemma, the second from reg^;, < 1, the 
third from reg^ < II, the fourth from — cl — c* < 0, and the fifth because 
It = Il + Ir- 

Case 3: T outputs Z, and q > c^. We have regT^ = reg^ = q — c*. The left 




follows from reg^^ < | for A; > 2. 



vegTSr = regL{SR + -Sl + cv - q) = reg^^L + regi(S'ij + q- - c 
< regi (il + Ir- 2ci + ^]<Ir + reg^ ( II - 2q + J 




Cl) 
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hand side can be written as 



legTSr = legi^SR + 5l + q - c^) 

1^1. k-\L\ 

< -Y^L + regi \Ir^ Cr + ci-Cj 

< ^II + Ir+ {Cl - 2Cr) < + /r + (q - Cr)) = ^It, 

The first inequahty follows from the inductive hypothesis and the lemma, 
the second from reg^ < 1 and reg^ < II, and the third from Cr > and 
k/2 > 1. 

Case 4: T outputs r, and q > c^.. Let w = ci — Cr- We have regj^ = 
Cr — c* = reg£ — w. The left hand side can be written as 

reg^ST = (reg^ - w){Sr + Sl + w) 

= reg^Si - wSl + (regj^ - w){Sr + w) 

IMr (t 1^1 \ / ^(t « k-\L\ 

< -ttIl -w\Il + -^-ci\+ (regi -w)\Ir + ci- 2cr + 



(r 1^1 \ 

+ (regL - w;) (/r + q - 2cr) 

A; A; 
< 2^^L + Ir) - w- - w{Il - Cl) + (reg^ - w){ci - 2cr). 

The first inequality follows from the inductive hypothesis and the lemma, 
the second from reg^ < II-, and the third from regj^ < |. 

The last three terms are upper bounded by —w — u^reg^ + wci + reg^q — 
2crreg^ — wci + 2wCr < —w — regj^(Cf. + q) + rcg^q + 27uc,. < — — (q — 
c*)cr + wcr + {ci — Cr)Cr < 0, and thus can be ignored, yielding legj^Sx < 
+ Ir) = ^It, which completes the proof. Taking the expectation over 
(x, c) completes the proof. | 

5.3. Tightness of Theorem 4 

The following simple example shows that the theorem is essentially tight. 
Let A; be a power of two, and let every label have cost if it is is even, and 1 
otherwise. The tree structure is a complete binary tree of depth log k with 
the nodes being paired in the order of their labels. Suppose that all pairwise 
classifications are correct, except that class k wins all its log k games leading 
to cost-sensitive multiclass regret 1. We have regy = 1, S't = | + log k — 1, 
and It = log A;, leading to the regret ratio TegrpST/Ir = ^( 2iogfc )' almost 
matching the theorem's bound of 4 on the ratio. 
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Tree vs Filter Tree performance 



All Pairs vs AP-Filter Tree performance 




10 20 30 40 
Tree Error Rate 



20 30 40 
All Pairs Error Rate 



Figure 2: Error rates (in %) of Tree versus Filter-Tree (top) and All-Pairs versus All- 
Pairs Filter Tree (top) on several different datasets with a decision tree or logistic regression 
classifier. 



5.4- Experimental Results 

There is a variant of the Filter Tree algorithm, which has a significant 
difference in performance in practice. Every classification at any node n 
is essentially between two labels computed at test time, implying that we 
could simply learn one classifier for every pair of labels that could reach n at 
test time. (Note that a given pair of labels can be compared only at a single 
node, namely their least common ancestor in the tree.) The conditioning 
process and the tree structure gives us a better analysis than is achievable 
with the All-Pairs approach [12]. This variant uses more computation and 
requires more data but often maximizes performance when the form of the 
classifier is constrained. 

We compared the performance of Filter Tree and its All-Pairs variant 
described above to the performance of All-Pairs and the Tree reduction, 
on a number of publicly available multiclass datasets [4]. Some datasets 
came with a standard training/test split: isolet (isolated letter speech 
recognition), optdigits (optical handwritten digit recognition), pendigits 
(pen-based handwritten digit recognition), satimage, and soybean. For all 
other datasets, we reported the average result over 10 random splits, with 
2/3 of the dataset used for training and 1/3 for testing. (The splits were 
the same for all methods.) 

If computation is constrained and we can afford only 0(log k) computa- 
tion per multiclass prediction, the Filter Tree dominates the Tree reduction, 
as shown in Figure 2. 

If computation is relatively unconstrained, All-Pairs and the All-Pairs 
Filter Tree are reasonable choices. The comparison in Figure 2 shows that 
there the All-Pairs Filter Tree yields similar prediction performance while 
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using only 0{k) computation instead of O(fc^). 

Test error rates using decision trees ( J48) and logistic regression as binary 
classifier learners are reported in Table A.l, using Weka's implementation 
with default parameters [19]. The lowest error rate in each row is shown in 
bold, although in some cases the difference is insignificant. 

6. Error-Correcting Tournaments 

In this section, we extend filter trees to m-elimination tournaments, also 
called (m — l)-error-correcting tournaments. As this section builds on Sec- 
tions 4 and 5, understanding them is required before reading this section. 
For simplicity, we work with only the multiclass case. An extension for 
cost-sensitive multiclass problems is possible using the importance weight- 
ing techniques of the previous section. 

6.1. Algorithm Description 

An m-elimination tournament operates in two phases. 

The first phase consists of m single-elimination tournaments over the 
k labels where a label is paired against another label at most once per 
round. Consequently, only one of these single elimination tournaments has 
a simple binary tree structure; see, for example, Figure 3 for an m = 3 
elimination tournament on A; = 8 labels. There is substantial freedom in 
how the pairings of the first phase are done; our bounds depend on the 
depth of any mechanism which pairs labels in m distinct single elimination 
tournaments. One such explicit mechanism is given in [6]. Note that once an 
example has lost m times, it is eliminated and no longer influences training 
at the nodes closer to the root. 

The second phase is a final elimination phase, where we select the winner 
from the m winners of the first phase. It consists of a redundant single- 
elimination tournament, where the degree of redundancy increases as the 
root is approached. To quantify the redundancy, let every subtree Q have a 
charge cq equal to the number of leaves under the subtree. First phase win- 
ners at the leaves of the final elimination tournament have charge 1. For any 
non-leaf node comparing the outputs of subtrees A and B, the importance 
weight of a binary example created at the node is set to either ca or cb, 
depending on whether the label comes from B or A. In tournament applica- 
tions, an importance weight can be expressed by playing games repeatedly 
where the winner of A must beat the winner oi B cb times to advance, and 
vice versa. When the two labels compared are the same, the importance 
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Final Winner 



Figure 3: An example of a 3-elimination tournament on k — 8 players. There are m — 3 
distinct single elimination tournaments in first phase — one in black, one in blue, and one 
in red. After that, a final elimination phase occurs over the three winners of the first 
phase. The final elimination tournament has an extra weighting on the nodes, detailed in 
the text. 

weight is set to 0, indicating there is no preference in the pairing amongst 
the two choices. 

6.2. Error Correcting Tournament Analysis 

A key concept throughout this section is the importance depth, defined as 
the worst-case length (number of games) of the overall tournament, where 
importance-weighted matches in the final elimination phase are played as 
repeated games. In Theorem 11 we prove a bound on the importance depth. 

The computational bound per example is essentially just the importance 
depth. 

Theorem 7. (Structural Depth Bound) For any m-elimination tourna- 
ment, the training and test computation is 0(m + lnA;) per example. 

Proof: The proof is by simplification of the importance depth bound 
(theorem 11), which bounds the sum of importance weights at all nodes in 
the tournament. 

To see that the importance depth controls the computation, first note 
that the importance depth bounds the tournament depth since all impor- 
tance weights are at least 1. At training time, any one example is used at 
most once per tournament level starting at the leaves. At testing time, an 
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unlabeled example can have its label determined by traversing the structure 
from root to leaf. | 



6.3. Regret analysis 

Our regret theorem is the analogue of Corollary 3 for error-correcting 
tournaments, and the notation is as defined there. As in the previous sec- 
tion, the reduction transforms a multiclass distribution D into an induced 
distribution D' over binary labeled examples. As before, T{f ) denotes the 
multiclass classifier induced by a given binary classifier / and tournament 
structure T. 

It is useful to have the notation \m\ ^ for the smallest power of 2 larger 
than or equal to m. 

Theorem 8. (Main Theorem) For all distributions D over k-class exam- 
ples, all binary classifiers f, all m- elimination tournaments T, the ratio of 
reg(T(/),D) to reg{f,D') is upper bounded by 



The first case shows that a regret ratio of 3 is achievable for very large 
m. The second case is the best bound for cases of common interest. For 
m = 41nA; it gives a ratio of 5.5. 

Proof: The proof holds for each input x, and hence in expectation over x. 

Fix x, and let py = D{y | x) for y G {1, . . . , k}. We can define the regret 
of any label y as ry = p* — py, where p* = m.aiXa£{i,... ^k} Pa- 

The regret of a node n comparing labels a and b from subtrees A and 
B, and outputting a, is 



where we use the predicate (z)+ = max(z, 0). Thus r„ is if n outputs the 
more likely label. If n is in a first phase tournament, r„ = {pi, — Pa)+- 

Finally, the regret of a subtree T is defined as tt = 

The first part of the proof is by induction on the tree structure F of the 
final phase. The invariant for a subtree Q of F won by label a is 



where w is the winner of a first phase single-elimination tournament W. 




for all m>2 and k > 2 

for all k < 2^^ and m < 4 logs k 



Tn = CB{Pb-Pa) + , 




weL{Q) 
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When (5 is a leaf w of F, we have cqTw = < rw, where the inequaUty 
is from Corollary 3 noting that the depth of W times the average regret over 
the nodes in W is rw. 

Assume inductively that the hypothesis holds at node n comparing labels 
a and b from subtrees A and B, and outputting a: CAra < fA + Ylw&L{A) ''^w 
and CBrb < rB + J2weL{B) ^w- We have rQ + Y,yj€L{Q) > rn + CAVa + CBn 
by the inductive hypothesis. 

Now, there are two cases: Either pi, < pa, in which case r„ = and 
CAra + > CAra + c^ra > Cgr^, as desired. Or p^ > Pa, in which case 
Tn = CBijPb - Pa) and thus 

Tn + CATa + CbH = CBPb - CBPa + CAP* - CAPa + CbP* - CBPb 
= P*CQ - PaCq = {p* - Pa)CQ = raCg, 

finishing the induction. 

Finally, letting y be the prediction of T{f) on x, 

mieg{T{f),D \ x) = cpry < rp + ^ rw < dieg{f, D' \ x)), 

w€L{F) 

where d is the maximum importance depth. Applying the importance depth 
theorem (Theorem 11) and algebra completes the proof. | 

The depth bound follows from the following three lemmas. 

Lemma 9. (First Phase Depth bound) The importance depth of the first 
phase tournament is bounded by the minimum of 

[log2 A;] + m[log2(riog2 k] + 1)] 
1.5[log2 k]+3m + l 

For k < 2*52 ^ < 4iQg_^ 2(m -l)+\nk + \/lnfe^lnfe + 4(m - 1). 

Proof: The depth of the first phase is bounded by the classical problem 
of robust minimum finding with low depth. The first three cases hold be- 
cause any such construction upper bounds the depth of an error-correcting 
tournament, and one such construction has these bounds [6]. 

For the fourth case, we construct the depth bound by analyzing a con- 
tinuous relaxation of the problem. The relaxation allows the number of 
labels remaining in each single elimination tournament of the first phase to 
be broken into fractions. Relative to this version, the actual problem has 
two important discretizations: 
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1. When a single-elimination tournament has only a single label remain- 
ing, it enters the next single elimination tournament. This can have 
the effect of decreasing the depth compared to the continuous relax- 
ation. 

2. When a single-elimination tournament has an odd number of labels 
remaining, the odd label does not play that round. Thus the num- 
ber of players does not quite halve, potentially increasing the depth 
compared to the continuous relaxation. 

In the continuous version, tournament i on round d has ~^ labels, where 
the first tournament corresponds to i = 1. Consequently, the number of 
labels remaining in any of the tournaments is ^ Z^^^Li (j^i) • We can get an 
estimate of the depth by finding the value of d such that this number is 1. 

This value of d can be found using the Chernoff bound. The probability 
that a coin with bias 1 /2 has m— 1 or fewer heads in d coin flips is bounded by 

m ^2 d J ^ and the probability that this occurs in k attempts is bounded 
by k times that. Setting this value to 1, we get Ink = 2d[^ — ^^^) • Solving 
the equation for d, gives d = 2(m — l)-|-ln ■\/4(m — 1) In A; -|- (In A;)^. This 
last formula was verified computationally for k < 2^^ and m < 4 log2 k by 
discretizing k into factors of 2 and running a simple program to keep track of 
the number of labels in each tournament at each level. For k G {2^"^ + 1, 2^}, 
we used a pessimistic value of A; = 2'~^ -|- 1 in the above formula to compute 
the bound, and compared it to the output of the program for A; = 2'. | 

Lemma 10. (Second Phase Depth Bound) In any m- elimination tourna- 
ment, the second phase has importance depth at most [m] 2 ~ 1 rounds for 
m > 1. 

Proof: When two labels are compared in round i > 1, the importance 
weight of their comparison is at most 2*~^. Thus we have X^|=f^ ™^ ^ 2*~^ -|- 
[m\2 = \m]2 - 1. I 

Putting everything together gives the importance depth theorem. 

Theorem 11. (Importance Depth Bound) For all m- elimination tourna- 
ments, the importance depth is upper bounded by 

[log2 k] + m[log2(riog2 k]+l)] + \m\2 
1.5[log2 k] + 3m + [m]2 
[|] +2m+ \m]2 

For k < 2^^ and m < 4 log2 k, 2m + [m] 2 + 2 In A; 2\/mhik. 
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Proof: We simply add the depths of the first and second phases from 
Lemmas 9 and 10. For the last case, we bound i^ln A; + 4(m — 1) < \/ln A; + 
2^fm and eliminate subtractions in Lemma 10. | 

7. Lower Bound 

All of our lower bounds hold for a somewhat more powerful adversary 
which is more natural in a game playing tournament setting. In particular, 
we disallow reductions which use importance weighting on examples, or 
equivalcntly, all importance weights are set to 1. Note that we can modify 
our upper bound to obey this constraint by transforming final elimination 
comparisons with importance weight i into 2i — 1 repeated comparisons 
and use the majority vote. This modified construction has an importance 
depth which is at most m larger implying the ratio of the adversary and the 
reduction's regret increases by at most 1. 

The first lower bound says that for any reduction algorithm there 
exists an adversary A with the average per-round regret r such that A can 
make B incur regret 2r even if B knows r in advance. Thus an adversary 
who corrupts half of all outcomes can force a maximally bad outcome. In the 
bounds below, js denotes the multiclass classifier induced by a reduction B 
using a binary classifier /. 

Theorem 12. For any deterministic reduction B from k > 2 classifica- 
tion to binary classification, there exists a choice of D and f such that 
Teg{fB,D)>2Teg{f,B{D)). 

Proof: The adversary A picks any two labels i and j. All comparisons in- 
volving i but not j, arc decided in favor of i. Similarly for j. The outcome of 
comparing i and j is determined by the parity of the number of comparisons 
between i and j in some fixed serialization of the algorithm. If the parity is 
odd, i wins; otherwise, j wins. The outcomes of all other comparisons are 
picked arbitrarily. 

Suppose that the algorithm halts after some number of queries c between 
i and j. If neither i nor j wins, the adversary can simply assign probability 
1/2 to i and j. The adversary pays nothing while the algorithm suffers loss 
1, yielding a regret ratio of oo. 

Assume without loss of generality that i wins. The depth of the tourna- 
ment is either c or at least c+1, because each label can appear at most once 
in any round. If the depth is c, then since k > 2, some label is not involved 
in any query, and the adversary can set the probability of that label to 1 
resulting in p{B) = oo. 
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Otherwise, A can set the probability of label j to be 1 while all others 
have probability 0. The total regret of A is at most [^^J , while the regret 
of the winning label is 1. Multiplying by the depth bound c + 1, gives a 
regret ratio of at least 2. | 

Note that the number of rounds in the above bound can depend on A. Next, 
we show that for any algorithm B taking the same number of rounds for any 
adversary, there exists an adversary A with a regret of roughly one third, 
such that A can make B incur the maximal loss, even if B knows the power 
of the adversary. 

Lemma 13. For any deterministic reduction B to binary classification with 

number of rounds independent of the query outcomes, there exists a choice 
ofD and f such that Teg{fB,D) > (3 - |) reg(/, 5(D)). 

Proof: Let B take q rounds to determine the winner, for any set of query 
outcomes. We will design an adversary A with incurs regret r = g^z^, such 
that A can make B incur the maximal loss of 1, even if B knows r. 

The adversary's query answering strategy is to answer consistently with 
label 1 winning for the first -^^^-^r rounds, breaking ties arbitrarily. The 
total number of queries that B can ask during this stage is at most (k — l)r 
since each label can play at most once in every round, and each query 
occupies two labels. Thus the total amount of regret at this point is at most 
{k — l)r, and there must exist a label i other than label k with at most r 
losses. In the remaining q — = r rounds, A answers consistently with 

label i and all other skills being 0. 

Now if B selects label 1, A can set D{i \ x) = 1 with r/q average 
regret from the first stage. If B selects label i instead, A can choose that 
D{1 I x) = 1. Since the number of queries between labels i and k in the 
second stage is at most r, the adversary can incurs average regret at most 
r/q. li B chooses any other label to be the winner, the regret ratio is 
unbounded. | 
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.148 


Logistic 


Regression 


Dataset(fc) 


Tree 


FT 


AP 


APFT 


Tree 


FT 


AP 


APFT 


arrhythmia (13) 


37.64 


36.37 


34.32 


34.97 


55.27 


55.04 


40.44 


34.97 


audiology (24) 


32.37 


31.93 


28.08 


28.21 


31.83 


27.69 


24.98 


25.90 


ecoli (8) 


21.00 


18.75 


18.90 


18.75 


18.00 


18.10 


15.20 


15.06 


flare (7) 


16.42 


16.38 


16.38 


15.57 


16.17 


16.07 


16.09 


16.03 


glass (6) 


33.84 


34.02 


32.18 


31.86 


39.37 


38.46 


38.43 


38.13 


isolct (26) 


27.30 


24.60 


12.40 


14.60 


35.30 


26.50 


8.40 


8.40 


kropt (18) 


40.32 


39.66 


36.50 


35.81 


58.55 


58.09 


56.34 


57.06 


letter (25) 


16.53 


15.96 


9.58 


11.77 


51.84 


49.89 


16.66 


17.62 


lymph (4) 


25.22 


22.28 


21.83 


22.28 


24.32 


24.20 


23.86 


24.07 


nursery (5) 


3.55 


3.49 


3.49 


3.49 


7.36 


7.41 


7.39 


7.39 


optdigits (10) 


15.50 


13.50 


10.60 


12.20 


18.40 


11.70 


5.00 


5.90 


page-blocks (5) 


2.99 


2.84 


3.00 


2.95 


4.06 


3.31 


3.12 


3.21 


pendigits (10) 


8.00 


7.60 


7.00 


7.60 


23.40 


22.40 


6.10 


5.10 


satimage (6) 


14.60 


15.10 


14.30 


14.30 


25.80 


24.50 


15.20 


15.10 


soybean (19) 


15.70 


13.00 


13.00 


13.00 


16.80 


16.50 


13.60 


13.60 


vehicle (4) 


30.86 


31.11 


31.57 


28.93 


21.60 


21.37 


20.78 


20.31 


vowel (11) 


29.06 


28.92 


24.64 


24.57 


35.85 


30.53 


11.85 


12.90 


yeast (10) 


44.04 


44.21 


43.99 


44.06 


45.13 


43.66 


42.28 


43.26 



Table A.l: Test error rates (in %) using J48 and logistic regression as binary learners. AP 
and FT stand for All-Pairs and Filter Tree respectively. APFT is the All-Pairs variant of 
the Filter Tree. 
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